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I.  INTRODUCTION 

In  any  form  of  scientific  research  or  decision  making,  it  is  desir¬ 
able  to  draw  upon  all  relevant  data  which  is  available.  Unfortunately, 
data  derived  from  different  sources  often  takes  on  forms  which  are  incom¬ 
patible.  Consequently,  much  of  the  information  is  often  not  used  and  is 
thereby  effectively  "lost." 

Simulation  users  frequently  find  themselves  in  this  situation  when 
observations  have  been  obtained  both  from  a  computer  model  and  from  the 
corresponding  real-world  situation  it  simulates.  Although  the  real-world 
observations  comprise  the  most  valid  of  the  two  data  sets,  the  other  set 
may  also  contain  useful  information. 

In  general,  real-world  (experimental)  observations  are  subject  to 
statistical  variation.  Simulation  outputs  from  a  model  of  the  same  situ¬ 
ation  not  only  contain  statistical  variation,  but  also  may  be  clouded  by 
possible  model  inadequacies.  If  the  model  is  a  valid  representation  of 
the  corresponding  real-world  situation,  then  the  two  types  of  data  (simu¬ 
lation  and  experimental)  are  of  equal  value.  However,  if  model  validity 
is  in  question,  the  simulation  data  may  be  of  less  value  than  the  experi¬ 
mental  data.  This  raises  the  issue  of  model  validation,  which  has  been 
discussed  by  a  number  of  authors  (e.g.,  [1],  [2],  [3],  14],  [5]).  The 
validation  task  is  to  compare  the  simulation  data  with  that  of  the  real- 
world  system,  usually  by  means  of  a  hypothesis  test. 

This  paper  does  not  deal  with  validation,  per  se,  but  rather  with  a 
topic  we  label  "data  integration."  With  data  integration,  we  aren't  in¬ 
terested  in  a  yes/no  decision  about  whether  or  not  the  simulation  is  valid. 
Instead,  we  are  concerned  with  whether  the  simulation  data  is  useful.  In 
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other  words,  we  want  to  determine  how  we  can  best  use  the  simulation  data 
to  supplement  the  real-world  observations.  Thus,  the  focus  of  data  inte¬ 
gration  is  to  determine  the  best  procedure  for  combining  data  obtained  from 
a  simulation  and  from  the  corresponding  real-world  situation. 

In  this  paper,  we  assume  that  we  are  dealing  with  situations  in  which 
one  simulation  run  yields  a  single  response  vector,  rather  than  a  time 
series  vector.  Thus,  we  are  restricting  our  attention  to  terminating  simu¬ 
lations.  Figure  1  illustrates  this  framework,  in  which  we  have  two  data 
sources,  each  of  which  generates  a  response  vector  which  is  a  function  of 
_x,  a  vector  of  input  variables. 

For  purposes  of  this  paper,  we  assume  that  each  observation  is  produced 
under  identical  conditions,  i.e.,  for  a  specific  value  x  «  Xq*  In  view  of 
this,  we  will  suppress  the  dependence  of  the  response  on  x  in  the  ensuing 
discussion.  The  real-world  response  £  estimates  some  unknown  parameter  vec¬ 
tor  £,  contaminated  by  random  error.  The  simulation  response  w  which  is  sup¬ 
posed  to  estimate  jj  is  not  only  affected  by  random  error,  but  also  may  be 
biased  because  of  inaccuracies  in  the  simulation  model.  Our  data  integra¬ 
tion  goal  is  to  obtain  the  most  accurate  estimate  of  ji  that  we  can. 


Figure  1:  Illustration  of  the  Problem  Framework 


II.  PROBLEM  DISCUSSION 


In  this  paper  we  examine  the  univariate  case.  Specifically,  consider 

the  situation  in  which  a  sample  of  n  Independent  real-world  observations 

2 

. yn  Is  observed,  where  y^  ~  N(y,o  )  and  the  object  is  to  estimate 

y.  Assume  that  in  addition  to,  and  independent  of,  the  y^'s,  m  independent 

observations  w. ,  . ...  w  are  available  from  a  simulation  model  for  which 
i  in 

2 

w4  ~  N(y  +  Ao,  a  ).  Thus,  if  A  j*  0,  the  simulation  data  contains  a  bias. 

As  an  aside,  we  should  note  that,  in  general,  n<m  and  n  is  usually  quite 
small  because  of  the  difficulty  and/or  expense  of  obtaining  real-world  ob¬ 
servations. 

Suppose  we  decide  to  estimate  y  by  using  an  estimator  of  the  form 

Op  - py+  (i-p)  w  , 

which  is  a  weighted  average  of  the  real-world  responses  and  the  simulation 
responses.  The  pooled  mean,  in  which  each  observation  is  weighted  equally, 
is  obtained  when  p  ■  p*  -  n/(n  +  m) ,  but  this  estimator  would  be  optimal  only  if 
the  simulation  model  were  valid,  i.e.,  if  A«0.  Since  the  assumption  of  a 
simulation  model  being  valid  is  at  best  tenuous,  we  will  examine  what  happens 
for  values  of  A»*0,  adopting  mean  square  error  (MSE)  as  the  measure  of  the 
goodness  of  an  estimator. 

For  general  p,  where  p  is  a  constant, 

MSE(yp)  •p202/n+(l-p)2[(o2/ra)+A2O2]  (1) 

If  we  chose  always  to  use  only  the  real-world  data,  then  our  estimator 
would  be  y^,  which  is  nothing  more  than  y.  For  this  estimator 

MSE^)  -o2/n  . 
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(2) 


On  the  other  hand,  if  we  chose  always  to  use  both  data  sources,  giving 
each  observation  equal  weight,  our  estimator  would  be  the  pooled  mean 

yp*  -  (ny  +  mw)  /  (n  +  m)  . 

The  corresponding  MSE  is 

MSEOy) -o2(n  +  m  +  m2A2)/(n  +  m)2  (3) 

As  one  would  expect,  for  values  of  A  close  to  zero,  the  estimator 
Up*  based  on  both  sets  of  observations  would  provide  a  smaller  MSB  than 
that  resulting  from  the  use  of  the  real-world  data  alone.  In  fact,  we  can 
see  from  equations  (2)  and  (3)  that  the  use  of  y^*  provides  better  perfor¬ 
mance  (i.e.,  smaller  MSE)  so  long  as  ^^[(n  +  mj/nm]**  .  However,  for 
larger  values  of  |A|  ,  the  inflation  in  MSE  rapidly  becomes  catastrophic; 
the  MSE  is  unbounded  as  j  A  |  -*■ »  . 

Of  course,  we  could  avoid  such  catastrophic  results  by  never  using  the 
simulation  observations,  i.e.,  by  always  using  the  estimator  y^  «  y.  How¬ 
ever,  by  adopting  this  conservative  minimax  strategy,  we  would  deprive  our¬ 
selves  of  the  opportunity  to  obtain  much  better  estimates  when  A  is  small. 


III.  TWO  DATA  INTEGRATION  APPROACHES 


.4 


One  approach  Co  this  problem  is  to  return  to  the  validation  framework, 
and  use  the  estimator  y^*  If  the  simulation  model  Is  judged  valid  or  the 
estimator  y^,  otherwise.  As  mentioned  previously,  a  judgement  about  the 
validity  of  a  simulation  model  is  usually  based  on  a  hypothesis  test.  In 
the  situation  being  discussed,  the  appropriate  hypothesis  test  would  in¬ 
volve  the  hypotheses 


ay  o 

based  on  the  t-statistic 

t-  [  nm/  (n+m)  ]  **  (y-w)  /s 

where  s  denotes  the  pooled  estimated  standard  deviation.  The  rejection 
region  (assuming  a  significance  level  of  a)  would  be 


where  t 


a/2,  n+m-2 
n+m-2  degrees  of  freedom. 


ltl>to/2,n4m-2 

denotes  the  upper  a/2  point  of  a  t-distributlon  with 


If  Hq  were  rejected,  the  estimator  y^  ■  y  would  be  used,  while  if  it 
were  not  rejected,  the  estimator  y^*  -  (ny+mw) /(n+m)  would  be  used.  It 
should  be  noted  that  an  estimate  of  A  is  given  by 


A  -  (y-w)/s  . 


Therefore,  the  validation  approach  results  in  the  use  of  the  estimator 


U  - 


if  tA|<[(n+m)/nm]*ta/2in+^2 
if  |A|>[(n4m)/nm]J*ta/2<n^_2 
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where,  of  course,  the  value  of  a  must  be  specified.  This  estimator,  it  will 

A 

be  noted,  is  based  on  an  all  or  nothing  rule - if  |A|  is  too  large,  only  the 

A 

real-world  observations  are  used,  whereas  if  |A|  is  not  too  large,  all  obser¬ 
vations  are  used.  Thus,  a  simulation  observation  is  given  weight  zero  or 
equal  weight  with  each  real-world  observation,  depending  upon  the  size  of 

A 

A. 

A 

A  more  flexible  procedure  would  incorporate  A  directly  into  the  estimate. 
Suppose,  therefore,  in  view  of  the  fact  we  are  unwilling  to  accept  the  as¬ 
sumption  of  A  -  0,  we  attempt  to  determine  an  adaptive  method  for  incorporating 
information  about  A  into  the  estimator  of  y  •  Using  equation  (1),  we  can  see 
that  by  setting 


3MSE(y  )/3p  ■  0  , 

P 

we  find  that 

2  2 

p  -  (n+nmA  )  / (n+nrfnmA  )  (4) 


provides  the  minimum  MSE.  It  should  be  noted  that  if  A»0,  p*n/(n+m)  so 

A 

that  y^  reduces  to  the  weighted  average  which  we  would  use  if  it  were  assumed 
that  the  simulation  observations  should  be  given  the  same  weight  as  the  real- 
world  observations. 

Of  course,  because  A  is  an  unknown  parameter,  the  value  of  p  provid¬ 
ing  the  minimum  MSE  is  also  unknown.  Thus,  we  might  consider  substituting 

A 

A  into  (4)  and  using  the  resulting  value 

^  ~2  a2 
p  ■  (n+nmA  ) / (n+m+nmA  )  . 


This  results  in  an  adaptive  estimator  y* 


Because  p  is  a  random  variable 


rather  than  a  constant,  MSE(yA)  cannot  be  obtained  by  substitution  into 
equation  (1). 
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IV.  EVALUATION  OF  MSE 


At  this  juncture,  we  have  four  estimators  to  consider  in  examining 
the  data  integration  task.  These  are: 

(a)  y  .,  which  always  uses  the  real-world  and  simulation 

P* 

observations  weighted  equally, 

(b)  y^,  which  always  uses  only  the  real-world  observations, 

A 

(c)  y,  which  is  based  on  a  test  of  the  validity  of  the  simulation 
model. 


and  (d)  y~,  which  is  an  adaptive  estimator. 

We  note  that  an  investigation  of  these  estimators  does  not  depend  on  the 
actual  values  of y and  a,  since  location  has  no  effect  on  the  results  and 
the  bias  of  any  simulation  observation  is  measured  in  units  of  o. 


In  order  to  compare  the  performance  of  these  four  estimators  for  a 


sample  size  (n,m),  their  MSE's  must  be  evaluated  for  different  values  of 


A.  This  poses  no  difficulty  in  the  case  of  the  first  two  estimators 

A  A 

(y  .  and  y  );  the  required  MSE's  are  given  by  equations  (1)  and  (2).  Un- 

P*  1 

/v 

fortunately,  things  aren't  so  easy  when  considering  the  estimators  y  and 

y~.  For  y,  we  must  compute  the  expected  value  of 

-  -  2 
[  (ny  +  mw)  /  (n  +  m)  -  y  ] 

over  the  region  in  (s,  y,  w)  -  space 

|(?-w)/s|<[<n+m)/nm]\/2>n+n_2> 

-  2 

and  the  expected  value  of  (y  -  p)  over  the  region 


(y-w)/8|2f(n  +  m)/nml\/2>  n+m_2 


For  the  adaptive  estimator  y*,  we  must  evaluate  the  expected  value  of 
[{n  +  nm[  (y  -  w) /s]^}y  +  mw] /Rn  +  m  +  nm[  (y  -  w)/s]^}-y]^  . 
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Because  of  their  complexity,  an  analytic  evaluation  of  these  ex¬ 
pected  values  is  an  impossible  task.  Thus,  we  must  turn  to  numerical 
integration  or  to  Monte  Carlo.  Since  we  could  not  eliminate  the  need  to 
evaluate  triple  integrals,  we  chose  to  use  Monte  Carlo  to  investigate  the 
two  specific  cases  of  (n“3,  m“10)  and  (n-3,  m=50) . 

A 

Tables  1  and  2  list  the  MSE  of  y  *  (the  pooled  mean  estimator) ,  of 

P 

A 

(the  validity  test  estimator)  and  of  y^  (the  adaptive  estimator)  relative 

2 

to  that  of  y^y,  which  is  a  /3  in  both  cases.  For  the  validity  test 

estimator,  five  values  of  a  were  considered.  These  were  .01,  .05,  .10, 

.20,  and  a  ,  where  a*  denotes  the  value  of  a  which  provides  an  MSE  equal 

a  * 

to  that  of  the  adaptive  estimator  y~  when  A»0.  For  (n*3,  n*10),  a  «  .17 

P 

* 

which  corresponds  to  a  t  value  of  1.50,  while  for  (n**3,  m*50) ,  a  ■  .12 
which  corresponds  to  a  t  value  of  1.58. 
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Table  2:  MSE's  of  estimators  relative  to  MSE(y^)  * MSE(y)  for  n-3,  m -  50 

(MSE's  for  y  and  y*  estimated  by  Monte  Carlo;  maximum  standard  error  is  0.05) 


V.  DISCUSSION 


As  can  be  seen  from  Tables  1  and  2,  and  graphically  from  Figures  2 
and  3,  none  of  the  four  estimators  dominates  (or  Is  dominated  by)  any  other 
estimator  in  terms  of  MSE.  This  in  Itself  is  not  surprising.  What  is  sur¬ 
prising,  and  somewhat  disconcerting,  is  that  the  simulation  data  is  useful 
(i.e.,  provides  a  more  accurate  estimate  of  y)  only  if  A  is  very  small.  For 
no  matter  which  estimator  (other  than  y)  we  adopt,  we  can  never  come  out 
ahead  if  |A|>0,  and  in  fact  we  may  wind  up  doing  substantially  worse  than 
we  may  have  thought  possible. 

A 

It  is  clear  that  the  pooled  mean  estimator  y^,  with  its  unbounded 
MSE  is  not  worth  considering.  By  using  either  the  validity  test  estimator 

A  A 

y  or  the  adaptive  estimator  y^,  we  will  come  out  ahead,  or  at  least  not  too 

P 

far  behind,  if  J A j  is  either  small  or  large.  It  is  for  moderate  values  of  |A|, 
approximately  1.0  < | A | <3.0,  that  the  worst  things  happen  to  us.  Therefore, 
somewhat  with  tongue  in  cheek,  we  see  that  the  resulting  moral  is  to  construct 
either  a  very  accurate  simulation  or  a  very  inaccurate  one. 

More  seriously,  though,  our  results  indicate  that  a  test  of  validity,  per 
se,  is  unwarranted  (and  hazardous)  if  the  data  is  to  be  used  for  parameter 
estimation.  We  can  see  this  from  Figures  2  and  3  by  examining  the  results  of 
a  validity  test  at  the  usual  significance  levels  of  .01  and  .05.  If  we  wish 


to  take  a  chance  on  combining  real-world  and  simulation  data,  y^  appears  to  be 
our  best  bet  since  it  provides  reasonable  gains  (decreases  in  MSE)  for  small 
| A |  and  in  the  worse  case  does  not  substantially  penalize  us. 


Figur*  3:  MSB's  of  the  estimators  relative  to  MSE(C. ) -MSE(y) 
for  n  >  3,  a*  50  1 
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In  this  paper  we  discuss  a  topic  we  label  "data  integration,"  which  addresses 
the  problem  of  combining  information  from  a  simulation  model  with  that  from 
the  corresponding  real-world  situation.  Although  data  Integration  is  related 
to  simulation  validation,  it  does  not  focus  on  a  yes/no  decision  about 
whether  or  not  a  model  is  valid.  Instead,  it  is  concerned  with  whether  or 
not  the  simulation  data  is  useful. 
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